Skip to content

[ci] split verify into 3 parallel jobs + aggregator (~7m -> ~2:45m)#72

Merged
trilamsr merged 2 commits into
mainfrom
ci/split-verify-3way
May 19, 2026
Merged

[ci] split verify into 3 parallel jobs + aggregator (~7m -> ~2:45m)#72
trilamsr merged 2 commits into
mainfrom
ci/split-verify-3way

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

Summary

The verify CI job currently runs 16 steps sequentially — 427s wall time on PR #70 (the most recent merged PR). Split into three parallel jobs grouped by the longest single step, with an aggregator named verify that preserves the existing branch-protection required check.

Partition

Job Steps Est. wall time
verify-test coverage-check (race tests) ~125s
verify-lint vet, lint ~155s
verify-static license-check, generate-check, build-tags, tidy-check, register-lint, nccl-fr RCE, actionlint, install+zizmor, ci-fuzz-nccl-fr, govulncheck, doc-check, build ~140s
verify (aggregator) needs: [verify-test, verify-lint, verify-static]; echo success ~5s

Wall time: max(125, 155, 140) ≈ 155s vs 427s today — saves ~4.5 minutes per PR.

Why this shape

  • coverage-check is the longest single step (~125s) — kept on its own job so it bounds the lower wall-time edge.
  • vet and lint share golangci-lint setup; pairing them keeps that overhead amortized.
  • verify-static is a grab-bag bounded by build (~55s) + fuzz (~40s). If it grows past the lint pole, rebalance later.

Branch protection

Unchanged. The required check verify continues to exist as the aggregator; sub-job failures propagate via needs:. No gh api calls or admin steps required.

Cost

Runner-minute cost rises from 1x to ~3.3x current (each job pays setup-go; Go module cache amortizes the rest). For a repo at this PR cadence the wall-clock win pays for itself in operator time.

Local invariant

make ci is unchanged — still sequential developer convenience. CI just spreads the same Make targets across 3 jobs.

Test plan

  • CI runs on this PR — verify-test, verify-lint, verify-static all green; verify aggregator green.
  • Wall time on this PR's CI run is observably lower than PR [chore] pr-review-loop: sync PR title and body in phase 5 #70 (~7m → ~2:45m target).
  • Aggregator failure mode: not exercised in this PR (would require an intentional break) — verified by inspection of needs: semantics.
  • make actionlint and make zizmor both clean locally on the diff.

trilamsr added 2 commits May 18, 2026 22:20
The verify job ran 16 steps sequentially for 427s wall time on PR
#70. Split into three parallel jobs grouped by the longest pole:

- verify-test:    coverage-check (race tests)   ~125s
- verify-lint:    vet + lint                    ~155s
- verify-static:  license/generate/build-tags/tidy/register-lint/
                  nccl-fr RCE/actionlint/zizmor/fuzz/govulncheck/
                  doc-check/build                ~140s

A final aggregator job `verify` declares `needs: [verify-test,
verify-lint, verify-static]` and echoes success. Branch protection
already requires `verify`; keeping the name means zero protection
churn while the three sub-jobs surface as informational rows.

Wall time drops from ~7m to ~2:45m (bounded by verify-lint).
Runner-minute cost is ~3.3x current — each job pays setup-go cache.

`make ci` is unchanged — still sequential developer convenience.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds one sentence to the partition comment block so a contributor
adding a new lint/test/scanner gate knows the default bucket
(verify-static) and the rebalance trigger (when it pushes
verify-static past the verify-lint pole). Surfaced during review of
the parallelization change.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr enabled auto-merge (squash) May 19, 2026 05:32
@trilamsr trilamsr merged commit 7375cef into main May 19, 2026
8 checks passed
@trilamsr trilamsr deleted the ci/split-verify-3way branch May 19, 2026 05:36
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Two root-cause fixes plus two compressions to the `pr-review-loop`
skill.

**Drops the ralph-loop plugin dependency.** The old path passed the full
~17 KB prompt body through `/ralph-loop:ralph-loop` as `$ARGUMENTS`,
which Claude Code's permission classifier aborted with "parser timeout,
resource limit, or over-length" on real-world invocation (caught when I
tried to invoke the skill on PR #72). The documented manual-paste
fallback hits the same classifier limit and was itself a dead end. The
skill now orchestrates the 5 phases directly in the current session via
`Agent` tool dispatches; no plugin invocation.

**Adds a diff-size + path precondition.** Under 200 changed lines AND
all paths in `*.md` / `.github/**` / `docs/**` / `.claude/**`
auto-offers a single-subagent review instead of paying for ~14 subagent
dispatches and 5 review-pass commits. The skill's own "When to run /
Skip" guidance already named these as the wrong fit; the gate now
enforces it. Operator opts in explicitly to keep the heavy loop.

**Cuts.** The two-form `M<X>` / `(no-milestone)` substitution rule
(resolved in the precondition step now, not in the prompt body). The
enumerated MEMORY.md slug list (slugs are already in context via
auto-memory). The reviewer-brief's motivational prose. The duplicate
stopping-rule block (merged with acceptance into one canonical
checklist).

**Keeps.** All 5 phase definitions (Phase 1 self → Phase 2
stakeholder-lens parallel → Phase 3 adversarial → Phase 4 A+ aspiration
→ Phase 5 simplify + PR title/body sync). Validation cycle. TDD
discipline. Reviewer-brief output format. Rubric evolution flow.
Final-report format. Readiness-audit gate before completion promise.

Net diff: -131 lines (-347/+216), 670 → 538 line file. The bigger change
is structural: the skill is now self-contained instead of delegating to
a third-party plugin.

## Test plan

- [ ] Skim the precondition block — `#4` correctly auto-routes small
CI/docs/`.claude/` PRs to a single-subagent path and only continues to 5
phases on explicit operator confirmation.
- [ ] Skim "How the loop runs" — direct orchestration is clear, no
leftover ralph-loop invocation language.
- [ ] Spot-check that all 5 phase definitions are intact and reference
the same `<reviewer-brief>`.
- [ ] Confirm the merged acceptance/stopping-rule block still gates the
completion promise on every prior criterion (PR title sync, AI-vocab
clean, `make ci` green, `gh pr checks` green, DCO trailers, no `main`
commits).
- [ ] `make doc-check` clean on this branch (banned-phrase lint, link
resolution, no growth in `(unverified)` count).
- [ ] On the next `/pr-review-loop` invocation against a real PR,
verify: (a) milestone auto-resolves from the branch name; (b)
small/low-risk diff triggers the single-subagent prompt; (c) explicit
override starts the 5 phases without trying to invoke
`/ralph-loop:ralph-loop`.

## Out of scope

Two items from the optimization sketch were not in this PR's ask and
remain as follow-ups: dropping the rubric-evolution machinery if it
isn't load-bearing in practice, and a lightweight sibling skill
(`/pr-review`) for the everyday case.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

PR #72's verify split introduced a silent branch-protection bypass. The
aggregator job uses default `needs:` semantics, so any sub-job failure
(`verify-test`, `verify-lint`, or `verify-static`) results in the
aggregator being SKIPPED. GitHub treats SKIPPED required checks as
satisfied, which means any failing sub-job allowed the PR to merge as if
all checks passed.

Caught on PR #73, which merged at 05:58:35Z despite `verify-test`
failing at 05:57:19Z. The required check `verify` showed as SKIPPED
rather than FAILURE, and protection treated the SKIPPED state as a pass.

## Fix

The aggregator now runs `if: always()` (so it runs even when sub-jobs
fail) and explicitly inspects each `needs.<sub-job>.result`, exiting
non-zero when any sub-job did not return `success`. Each non-success
result emits a `::error::` annotation naming the offending sub-job for
operator debuggability.

```yaml
verify:
  needs: [verify-test, verify-lint, verify-static]
  if: always()
  steps:
    - run: |
        fail=0
        if [ "${{ needs.verify-test.result }}" != "success" ]; then ...
        ...
        [ "$fail" -eq 1 ] && exit 1
        echo "all verify-* gates passed"
```

## Why it matters

Without this fix, the entire branch protection on `main` is effectively
trivial — any flaky sub-job silently lets a merge through. The hole was
open for the window between PR #72 (merged ~05:36Z) and this fix.

## Test plan

- [ ] `make actionlint` clean ✅
- [ ] `make zizmor` clean ✅ (no findings, suppressed count went up
because `${{ needs.*.result }}` is correctly recognized as a safe
template expression).
- [ ] CI on this PR exercises the new aggregator path: all three
sub-jobs run, aggregator inspects results. If all pass, aggregator
passes. If any fails, aggregator will FAIL (the previous behavior would
have skipped here).
- [ ] After merge, the next PR that hits a sub-job failure (e.g., a
flaky test) will correctly block on `verify` reporting FAILURE.

## Out of scope

The `TestReceiver_SLIBudget` flake observed on PR #73 (`p99 539ms >
500ms target` under race-on runner contention) is unrelated to this fix
and worth a separate follow-up — either widen the budget under race or
skip the SLI assertion in race mode.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

After PR #72's verify split (wall 427s → 242s actual, vs 155s
projected), investigated whether setup-go cache sharing could close the
gap. It can't: all three `verify-*` jobs already share the same cache
key and hit it. The gap is `verify-test` bounded at 235s by `make
coverage-check` (race-on `go test ./...` across 50+ packages). Records
two follow-ups in `docs/FOLLOWUPS.md` so future work doesn't re-chase
setup-go:

1. `coverage-check` sharding across multiple jobs — real wall-time win
(~150s target) but a real refactor (split test execution + coverage
aggregation). Triggered by PR throughput becoming binding.
2. Move `build-tags` and `build` from verify-static into verify-lint —
saves ~17% runner-minutes via dedupe of compile work, zero wall-time.
Cheap, but no speed benefit; triggered by runner-minute pressure.

## Test plan

- [x] `make doc-check` clean (banned-phrase, link-resolution,
`(unverified)` baseline).
- [ ] FOLLOWUPS-only PR; no code or workflow changes.

```release-notes
No user-facing change. Records a CI wall-time investigation under docs/FOLLOWUPS.md so future agents don't re-chase setup-go cache sharing as a speedup lever.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Adds two load-bearing lessons to `AGENTS.md` from this session's CI
work. Both prevent a future contributor from repeating the same trap.

**Aggregator bypass.** GitHub Actions short-circuits an aggregator job's
`needs:` to SKIPPED on any sub-job failure, and treats SKIPPED required
checks as satisfied. PR #73 silently merged past a failed `verify-test`
because the aggregator from PR #72's verify split was SKIPPED rather
than FAILURE. The fix shape (`if: always()` + `needs.*.result` check)
shipped in PR #74; this lesson documents the trap and the fix so anyone
splitting CI jobs in the future doesn't repeat it.

**Perf-budget regex flake class.** `require.Regexp` with implicit upper
bounds (e.g. `0\.0[0-9]+`) on values whose only invariant is `>0` flake
on slow CI runners. Two of these hit in one session:
`TestReceiver_SLIBudget` (emit-latency, observed 539ms) and
`TestReceiver_SetDegraded` (degraded-seconds, observed 0.126s). The fix
shape is the same in both — relax to any positive value
(`\d+\.[0-9]*[1-9]`) or use baseline-relative comparisons.

File goes from 128 to 148 lines (cap is 150, with 2 lines of remaining
headroom — next addition should consider demoting an older entry to a
topic note per the file's own promotion rule).

## Test plan

- [x] `wc -l AGENTS.md` reports 148, under the 150-line cap.
- [x] `make doc-check` clean (banned-phrase lint, 250 links resolve,
`(unverified)` count = 7 baseline).
- [x] Capture-flow format check (`learn-from-mistakes` skill): banned
vocabulary absent, no first-person AI phrasing, no AI attribution, both
entries carry `Anchor:` citations.
- [ ] CI on this PR exercises the same gates plus the aggregator that's
now itself an anchor of lesson #1.

```release-notes
NONE — documentation only. Adds two load-bearing lessons to `AGENTS.md` covering GitHub Actions aggregator semantics and a recurring perf-budget regex flake class. No runtime behavior change.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Three observations from the recent session that didn't fit the
structured surfaces (already used for the load-bearing AGENTS.md entries
in PR #81 and the agent-internal notes in PR #82). Each captured via the
`learn-from-mistakes` flow and lands in its existing topic note.

**`.claude/notes/automation.md`** — *Memory captures rationale; hooks
enforce.* The pre-PR checklist personal memory landed mid-session was
followed by a lint failure shipped to CI within the hour. The same gap
closed reliably by the `PreToolUse` hook installed shortly after. For
any "always do X before Y" pattern, prefer the hook; the memory
documents *why* the hook exists.

**`docs/notes/ci.md`** — *Frame CI / perf projections as ranges, not
single numbers.* PR #72's 155s wall-time projection vs 242s actual cost
an investigation round (later landed in PR #77) because the projection's
setup-go-cache amortization assumption was unverified. Either verify
assumptions empirically before publishing the number, or frame as a
range.

**`.claude/notes/review-patterns.md`** — *Self-rate work, then write
criteria for the next grade up.* Forces articulation of measurable
improvements rather than free-form "anything else?". PR #76's B+ → A →
A+ came from two iterations of this exact pattern; each iteration closed
real structural gaps.

A fourth lesson — "fix existing tools before proposing new ones" — was
captured to personal memory (no PR, lives in
`~/.claude/projects/.../memory/`), not the repo, because it's a judgment
heuristic about my own decision-making rather than a repo-resident
convention.

## Test plan

- [x] `make doc-check` clean (banned-phrase lint, link resolution, all
gates).
- [x] `learn-from-mistakes` format check: banned vocabulary absent, no
first-person AI phrasing, no AI attribution, all three entries carry
`Anchor:` lines pointing at concrete PRs.
- [ ] CI on this PR exercises `doc-check` + `pr-lint`.

## Rollback

Each entry is a self-contained `### title` + body + `Anchor:` block at
the top of its file. No dependents elsewhere; reverting is a single Edit
per file.

```release-notes
NONE — documentation only. Three meta-lessons from a recent session retrospective land in their existing topic notes (`automation.md`, `ci.md`, `review-patterns.md`). No runtime behavior change.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant